Probability-based protein identification by searching sequence databases using mass spectrometry data

1999 ◽  
Vol 20 (18) ◽  
pp. 3551-3567 ◽  
Author(s):  
David N. Perkins ◽  
Darryl J. C. Pappin ◽  
David M. Creasy ◽  
John S. Cottrell
2005 ◽  
Vol 11 (2) ◽  
pp. 161-167 ◽  
Author(s):  
Kenton D. Juhlin ◽  
Dionne D. Swift ◽  
Martin P. Lacey ◽  
Paul E. Correa ◽  
Thomas W. Keough

Many laboratories identify proteins by searching tandem mass spectrometry data against genomic or protein sequence databases. These database searches typically use the measured peptide masses or the derived peptide sequence and, in this paper, we focus on the latter. We study the minimum peptide sequence data requirements for definitive protein identification from protein sequence databases. Accurate mass measurements are not needed for definitive protein identification, even when a limited amount of sequence data is available for searching. This information has implications for the mass spectrometry performance (and cost), data base search strategies and proteomics research.


2018 ◽  
Author(s):  
K.C.T. Machado ◽  
S. Fortuin ◽  
G.G. Tomazella ◽  
A.F. Fonseca ◽  
R. Warren ◽  
...  

AbstractIn proteomics, peptide information within mass spectrometry data from a specific organism sample is routinely challenged against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or poorly genetically characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for ten different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bortedella pertusis generated very complex databases even having low pangenomic complexity or no pangenome at all. This suggests that discrepancies in gene annotation is higher than average between strains of those species. We further tested database performance by using mass spectrometry data from eight clinical strains from Mycobacterium tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases.


PROTEOMICS ◽  
2004 ◽  
Vol 4 (3) ◽  
pp. 619-628 ◽  
Author(s):  
Daniel C. Chamrad ◽  
Gerhard Körting ◽  
Kai Stühler ◽  
Helmut E. Meyer ◽  
Joachim Klose ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document